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Abstract 

In order to assess the short-term memory performance of non-linear random neural networks, we 
introduce a measure to quantify the dependence of a neural representation upon the past context. 
We study this measure both numerically and theoretically using the mean-held theory for random 
neural networks, showing the existence of an optimal level of synaptic weights heterogeneity. We 
further investigate the inhuence of the network topology, in particular the symmetry of recip¬ 
rocal synaptic connections, on this measure of context dependence, revealing the importance of 
considering the interplay between non-linearities and connectivity structure. 

Keywords: Sort-term memory. Echo-state networks. Recurrent neural networks. Random matrix 
theory. Mean-held theory. Information 


1. Introduction 

The ability of the neural representation of a signal to depend on its past context and the capacity 
of short-term memory are essential for most perceptive and cognitive processes, from vision to 
langage processing and decision making [291 HU El E]. Although adaptation and plasticity may 
play a crucial role to shape short-term memory, we study here the hypothesis that this phenomenon 
may be also explained by dynamical network ehects in particular due to the recurrent nature of 
neural networks connectivity nil mini [32]. To study this question from a theoretical standpoint, 
we consider the following classical model of recurrent neural network fRNN)|^ HI EHl I2H], often 
called rate model. 


x(f + 1) = S'(Wx(f) -f Vu(f) -h ri{t)) 

where x(f) G M"" describes the states of the n neurons, u(t) G M™' represents the signal, S{.) is 
typically a sigmoid function {e.g. tanh(.)), M is a n x m matrix projecting the input into the 
recurrent network, and W is a n x n matrix representing the internal connectivity of the recurrent 
network (Wp is the connection strength from neuron j to neuron i, and is often called synaptic 
weight). We will make various assumptions on the matrices V and W, and in particular we 
will investigate the disordered system where are independent A/'(0, /m), and W^- are either 

independent N'{0,a‘^/n) (random asymmetric connectivity) or constrained to be symmetric with 
Wjj = Wjj ~ A/'(0, (j^/4n) (random symmetric connectivity). The parameter a characterizes the 
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synaptic weights heterogeneity and is one of the most important parameter in this stndy becanse 
it controls the amonnt of recnrrence in the generation of the representation. The variables ri{t) 
are independent random centered Ganssian vectors in M"' with diagonal covariance e^I. In this 
model, we have assnmed a form of ’’network noise” that is associated with the network dynamics. 
However, it is also possible to consider at least two other natnral sonrces of noise, namely an 
’’inpnt noise” in which u(f) + r]{t) replaces u(f), or an ’’ontpnt noise” in which the observation of 
the network state x(f) is pertnrbed by a measnrement noise x(f) + r]{t). 

Most attempts to stndy theoretically short-term memory in similar random nenral networks 
models have been focnsed on the linear case S{x) = x becanse of the difficnlty to handle this 
qnestion in the non-linear regime. In the context of machine learning applications, [2Zj has shown 
the ability of this system to reconstrnct faithfnlly np to n time-steps in the past. In [22], Fisher 
Information was nsed to assess how short-term memory depends npon the connectivity strnctnre 
and the network size (see below). More recently, [TTl 12^ have stndied the relevance of applying 
compressed sensing concepts to this problem, showing the possibility for very long short-term 
memory, scaling exponentially with n, for the recovery of sparse signals. 

Before investigating the interplay between non-linearity and connectivity properties, which 
is the principal objective of the present article, we stndy statistical and information-theoretic 
measnres (section 2.1) of the dependence of x(f) npon past stimnlus u(t — k) for the linear model, 
and fnrther introdnce a new measnre, the context capacity (section 2.2), that will be easier to stndy 
in the non-linear case (section 3). 


2. Linear model S{x) = x 

Preliminary results. For simplicity, we consider in this section the case of ID inpnt time- 
series, i.e. m = 1 and V = v G We introdnce the transfer matrix: 

M = [v, Wv, WV,..., W^-^v] 


With this notation, in the absence of noise x(T) can be written as x(T) = Mu where u is the 
T-dimensional vector (m(0), ..., m(T — 1)). Then, in the case of network noise x(T) = Mu -|- 
viit — k) , in the case of inpnt noise x(T) = M(u -|- rj) and hnally in the case of ontpnt 
noise x(T) = Mu -I-?]. From these expressions, it is possible to estimate the covariance strnctnre of 
x(r), assnming that the time-series u is a Ganssian process with zero mean and covariance matrix 
C: 

E [x(T)x(T)'] = MCM' + n 

where G is dehned according to the type of noise considered: 


• Network noise : Q = 

• Inpnt noise : fl = e^MM' 

• Ontpnt noise : D = e^I. 
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2.1. Some statistical and information-theoretic measures 

From the expression x(T) = Mu in the absence of noise, one observes readily that when T = n, 
if the sqnare matrix M is invertible (which is almost snrely the case with random connectivity 
matrices), then the time-series u of length T can be recovered exactly with the observation of 
n = T neurons. However, this basic result does not take into account the impact of noise on the 
representation. In other words, if the matrix M is not well conditioned, then a small perturbation 
can lead to huge errors. 

The Cramer-Rao bound states that the variance of the reconstruction error for any estimator 
of u cannot be smaller than the inverse of the Fisher Information, therefore providing a universal 
bound for the input recovery problem. This measure of short-term memory has been studied in 
a remarkable work by 122] for the linear model with network noise, where it was shown that the 
Fisher matrix is given by 

= v'W'^H-^WV 

Here, the diagonal element Xkk is the Fisher information that x(f) contains about the input at 
time t — k, and characterizes the memory decay of the network representation. Interestingly, this 
measure of memory is independent of the input and characterizes only the recurrent neural network. 
After rescaling by the noise level X = e^X, one dehnes the total memory X = Tr(X) which satishes 
the following fundamental distinction: if W is a normal matrix, then / = 1, and otherwise, I < n 
and may behave extensively with network size. This result shows that the underlying structure of 
the connectivity W may have a profound impact on the dependence of the representation upon 
past context. In particular, random symmetric (normal) connectivity matrices appear to be less 
efficient than asymmetric ones (non-normal) in terms of short-term memory. 

To further quantify this statement, one can evaluate the mutual information between x(T) and 
u. 


1 |det(fI + MCM' 
J(x(T);u) = -log- 


= - log 
2 ^ 

= ^log 


I det HI 

det (^I„ + (H-^M)C(H-5M)' 
det flT + (MC5)'H-^(MC5) 


which is related to the Fisher information bj0 


/(x(T); = 2 I det(Ir -|- CaXC^ 


In FIG. 2 (left), the mutual information is displayed as a function of a for both the symmetric 
and asymmetric model, showing a clear superiority of the asymmetric model. To understand 
theoretically this observation, it is possible to use random matrix theory to evaluate the mutual 


^notice that a similar formula can be found in [22) . 
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information for n —)■ cx), assuming C = /i^In and an output noise setting, for which = e^I. When 
W is random asymmetric then 

1 ^ 

4sym(x(T); u) ~ iasym = - ^ log f 1 + 

^ k=0 ^ 


while the case of symmetric connectivity yields 


Isymip^iT')i u) 


•■sym 



det 



where L is a checkerboard matrix hlled with rescaled Catalan numbers: 

Lij = Cp-i if i + j = 2p with Cp = ^(J), and Uj = 0 if i + j is odd. 

Indeed, to prove this result one only needs to evaluate the entries of the matrix M'M. For random 
asymmetric matrices, in the large n limit; 

-(M'M)i,- = -v'W'*WW ^ 
n n 

whereas for random symmetric matrices: 

-(M'M)i,- = -v'W*+W -> Ui 
n n 

where the Catalan numbers arise as the even moments of the semi-circular law. 

The above determinant appears to be rather difficult to compute analytically, however it is still 
possible to compare the mutual information for these two cases. Applying Hadamard inequality 
to the determinant in Igym'- 



and using the inequality Cp < 4^, one concludes that the mutual information for symmetric linear 
RNN is smaller than the one for asymmetric case: 


I sym Ci Iasym- 


In fact, for any symmetric connectivity matrix with an eigenvalue distribution compactly supported 
in [—a, a] (not only the semi-circular law), the same conclusion remains valid. Therefore, we have 
shown in this section that, in the linear model, RNN with symmetric connectivities capture less 
information about the past inputs than asymmetric ones. 
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2.2. Context capacity 


All the previous estimations were heavily relying on the linear relationship between x and u, 
enabling the use of linear algebra and (random) matrix tools. However, as noticed for instance 
in 122 ], extending the above analysis to non-linear models is very challenging. To circumvent this 
difficulty, we introduce a new measure that quantihes how much the representation of a signal 
depends upon its past context, and that is amenable to analysis in the non-linear setting. To 
dehne our measure of context-dependence, we decompose the input time-series into two parts: the 
context is the input from time t = 1 to t = to, and the signal is the input from time t = to + 1 
up to some time t = to + t. From this decomposition, we hrst dehne the context sensitivity x 
follows (FIG. 1, left panel): 

(a) we consider the context to be randomly generated according to a given probability law. At 
each trial, an independent source of noise rj is also generated. 

(b) the signal is kept hxed to a specihc time-series u 

(c) we estimate the across-trial variance x(r) of the representations x(to + t) obtained for each 
realization. 

This variance can be explained by two sources of variability, namely the various contexts generated 
before the signal and the presence of noise in the construction of the representation. In order to 
normalize this variance with respect to the pure impact of noise, we introduce a measure of the 
variability of the representation due to noise only, called the unreliability coefficient p (FIG. 1, 
right panel): 

(a) we consider the context to be a hxed time-series (say a hxed random sample generated according 
to the same given probability law), whereas t each trial, an independent source of noise p is 
generated. 

(b) the signal is kept hxed to a specihc time-series u 

(c) we estimate the across-trial variance p(r) of the representations x(to + t) obtained for each 
realization. 

Finally, we are in position to dehne the context capacity C{t) as the ratio of the context sensitivity 
over the unreliability coefficient: 


G(r) 


X{r) 

pW 


If the representation is context-independent, then C'(r) = 1, and the higher is C'(r), the more 
context-dependent is the representation, and moreover, one expects C{t) to be a decreasing func¬ 
tion of T. Before studying the non-linear system, we hrst evaluate the context capacity in the 
linear case to study its relationship with the above statistical and information-theoretic measures. 

We make the assumption that to is large. To compute the context sensitivity, we consider that, 
before time t = to, both the noise and the input are treated as random processes. Therefore x(fo) is 
centered and has covariance C'ou(x(to)) = MCM'-|-f2. Then, at time T = to + r, x(T) is no longer 
centered, and has covariance Gou(x(T)) = (MCM' -|- G) where Xlr = 

Since to —)■ oo, one has the identity = G, and the context sensitivity is given by 


X = Tr (W^MCM'W'^ + G) 
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Figure 1: Schematic illustration of the definition of the unreliability coefficient p and context sensitivity y. 


To compute the unreliability coefficient, the input is always considered as deterministic, so the 
covariance of x(r) is C'on(x(r)) = hi. Therefore, the unreliability coefficient is 

p = Tr (fi) 


Finally, the context capacity is given by: 


C{t) 


1 + 


Tr (W^MCM'W'^) 
Tr {n) 


To investigate the impact of connectivity symmetry, we wish to analyze this formula from the 
RMT point of view, assuming C = /i^I. First we need the following trace lemma: in the limit 
n —>■ oo, 

• for the asymmetric random model, ^Tr(W^W'^) — 


• while for the symmetric random model, hTr(W*^W'^) —)■ Ck 

Since to —t oo, one hrst remark that: where is the 

extraction of the hrst k columns of M. Considering hrst the case of asymmetric connectivity, since 
rr(MM') = Tr(M'M) and since the diagonal terms of ^M'M converge to then: 

Tr (W'MCM'W'') ~ 


It remains to evaluate Tr(r2), which can be done using the trace lemma, yielding: 

Tr{Vt) ~ n ^ 

i — cr^ 
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Finally we obtain that when n ^ oo and T is hxed, 


lim Casymir) 

n—^oo 


1 -L ^ 

1 H— TtCT = C, 


asym 


j) 


This formula is very simple and shows that the context capacity for the linear random asymmetric 
RNN is given by the product of the signal/noise ratio ^ times a geometrically decaying term 
which is maximal when a approaches ones. Notice that the same product was already appearing 
in the expression of the mutual information I asym- 

The situation is, again, different for symmetric connectivity. Indeed, the diagonal terms of 
ilVI'M now converge to Cj_i (|)^^* and we apply the trace lemma to obtain that when n —)■ oo 
and T is hxed. 


lim Csymir) 

n—^oo 


1 + 



where, for a < 1: 


Or = / - -p{x)dx 

J-a 


with p(x) the density of the real eigenvalue distribution of W. In the case of the semi-circular law, 
using the generating function of the Catalan numbers, one obtains: 


0n = 


1 -I- \/l — (7^ 


and Qr 


e>o-E 

k=0 


2k 

4fc ^ 


As for the mutual information Isum-, this formula involves Catalan numbers and is more explicit 
because the evaluation of the trace is more straightforward than the evaluation of the determinant. 

From this result, using < 4^, we deduce that the context capacity for symmetric random 
model is lower than its asymmetric counterpart Csym(r) < CasymiT)- Therefore the context ca¬ 
pacity behaves similarly as more standard measures such as the mutual information, and, as we 
will see in the next section, offers an interesting alternative to investigate the interplay between 
non-linearity and connectivity properties. 


3. Non-linear model 

According to the mean-held theory [3H1 El EHl E] , in particular in the case of input-driven systems 
[361 ESI ED, a is the most important parameter in this system since it controls an order-disorder 
phase transition between a ’’stimulus-driven regime” for a < a critical and a ’’chaotic regime” for 
a > a critical- It has been argued that the regime close to criticality in such systems may be relevant 
in terms of information processing capabilities, both in the helds of neuroscience [2] and artihcial 
intelligence 13 i- Therefore, our hrst aim is to understand how this heterogeneity parameter 
affects the context capacity in the disordered neural network model. To measure the capacity C 
we need to specify how we generate various contexts and select a given signal. For simplicity, we 
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Figure 2: (Left) Mutual information as a function of a for the linear model (Middle) Context dependence capacity 
C(t) as defined in (2.2), for the non-linear system with random asymmetric connectivity, as a function of the 
standard deviation a of the synaptic weights, for different values of the delay r. Points correspond to stochastic 
simulations with n = 1000 neurons, and dotted lines correspond to theoretical predictions. (Right) Comparaison 
of the context dependence capacity C{t = 5) with respect to the symmetry of the connectivity. All numerical 
simulations of the non-linear model were done with S{x) = erf(Y^7r/2a;). 


assume that the input u{t) is a one-dimensional white noise process. In FIG. 2.(middle), we display 
the context capacity C{t) for different values of r as a function of the synaptic heterogeneity a for 
the random asymmetric model. As expected, the context capacity C'(r) is a decreasing function 
of T. More interestingly, it displays a maximal value for an intermediate value of cr > 1, revealing 
a trade-off between recurrence-induced memory and non-linear instabilities. To understand this 
observation from a theoretical standpoint, our strategy is to develop a mean-held approximation 
of G(r). To estimate the context sensitivity, we consider for each trial k G {1,..., A'} the solution 
x(^) of 


+ 1 ) = S{Wi6^\t) + 

with the same initial condition x(0) and where are independent realization of the random 
process rj{t) dehned above, and: 

- for 1 < t < to; are independent standard Gaussian variables, representing various contexts, 

- for to + l<t<to + T, for all trials, all the are equal to u(t), which is a frozen realization 

of a white noise process. 

In this problem, there are two different sources of randomness: the weights matrices W and V 
are randomly drawn once and for all, while the stochastic process rj and the various contexts are 
drawn at each trial k. The hrst source of randomness, although frozen, will be responsible, in 
the large n limit, for a phenomenon of self-averaging, which is a well-established property of the 
mean-held theory [Ml El [ID]. More precisely, population averages of the form ^ /(^*(^)) 

converge to the expectation Ew,v[/(xj(f))] over the law of the pair (W,V), which is actually a 
quantity independent of i. 

To estimate the across-trials variance of xp^(f), we need to compute: 


Vi{t) := {■Ki{tf)K - (Xi(t))^ 












where {z)k = ^ J2k=i denotes the across-trials average. Then we take the average over the 
neural population to obtain a scalar value v{t) = [n]Ar where [ z\n = ^ denotes the popula¬ 
tion average. Introducing the sample covariance between two trials: Pk,i{t) ■= we 

can rewrite: 

'^(^) = {Pk,kit))K — {{pk,lit)))K 

First, the trace term 7 (t) := {pk,k(t))K can be obtained using classical mean-field theory. Indeed, 


where the variables 


af)(t) = (Wx^'^^(t) + Vu(t) + 


7(t) = {[S{a{t))]N)K 


Ak) 


{k)l 


are asymptotically (for large n) independent Gaussian random variable with common zero mean 
and CO- variance: 

(») E (af'(i+)a®(«+)) = E ( ^ W,„Wi,x7(i)x<'>(i) 

\ a,b 


a,6 


a^b 


where t~^ = t + 1- Here, we denote by E(.) the expectation with respect to the joint law of 
(W,M,7, [u] 1“ ). In fact, we only need to consider here the variance E , which 

neither depend on i, since we took the average over (W, M), nor on k since we took the average 
over 7 and [u]^°. When considering k = I, the third term of the above sum is equal to by 
assumption on p, and the second one is given by The first term is more problematic, since W 
and X could be, in principle, correlated. However, and this is the key point in the mean-field theory, 
this dependence decays when n becomes very large, and in this asymptotic regime one can pretend 
X and W are independent (see [Ml HQ] for a rigorous justification, and |33] for a recent exposition 
of the application of the theory in the case of input-driven systems). Therefore, the first term in 
the sum can be approximated by and we obtain formally: E (aj(f)^) = cr^ 7 (f) -|- -|- e^. 

Knowing the mean and variance of since x(f-|-l) = S'(a(f)), one can write a recurrent equation, 
also called the mean-field equation: 

7 (f + 1) = + u{tf + e^) 


where 


F{x^) ■ = 




S{zxfe-^^/^dz. 


Then, the second term to compute is the average of the sample covariance that we denote by 
X{t). The situation is slightly different: since X{t -|- 1) = (([S'(a-^^)S'(af^)]Ar))i^, it appears here 
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that knowing the variance of is not sufficient, and we further need to estimate the covariance 
between and a^^^^ for k ^ 1. The first term in the sum (*) is again given by a‘^X{t), the second 
and third terms are equal to zero since all the trials are independent. Therefore, we obtain: 

X{t + 1) = G{a'^X{t), cT^7(t) + + e^) 


where: 


G{x,y) 

^{x,y) 




f y X \ 
\x y J 


We have obtained a deterministic dynamical system describing the variance across trials v{t) = 
■y(t) — X(t), holding from time f = 1 up to time t = to — 1: 


(E) 


7(t + 1) = F{a‘^'y{t) + + e^) 

X(t + 1) = G{a‘^X(t), o'^7(t) + + e^) 


To compute the context sensitivity coefficient y, one needs to solve the above dynamical system 
from time 1 to to ~ 1- Then, at time t = to, one switches to a system where = u is now the 

signal and is the same for all trials, so that from time t = to to t = to + r: 


(E') 


7(t + 1) = + K^u{tY + e^) 

X{t + 1) = G{a'^X{t) + H?u{tY, (j'^'yit) + K^u{tY + e^) 


To summarize, after solving (E), followed by (E’), one obtains x(r) = 7(to + x) — X(to + r). 

To compute the unreliability coefficient p, the only difference is that the context u is now frozen 
and does not change across trials. Eor each trial k G {1,K}, we consider the solution of 

+ 1) = 5(Wx(^)(t) + Mu{t) + v^^\t)) 


Therefore, the above derivation remains valid, with only minor modihcations, yielding a slightly 
modified version of the mean-held dynamical system, for t = ltot = to~l: 




7(t + 1) = F{a‘^'j{t) + K^u{tY + (?) 

X(t -|- 1) = G{a‘^X(t) + K^u{tY, cr’^'jit) + i?u(t)‘^ + ?) 


and for t = to to t = to + T we obtain exactly the same system (E’) dehned above. Therefore, after 
solving (E”), followed by (E’), one obtains p(r) = 'yito+r) — Xito+r) and hnally C(r) = x{x)/p{t). 
In FIG. 2 (middle), theoretical predictions are compared with numerical simulations showing a 
good agreement. 
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In light of the results obtained for the linear model, the next natural question is to compare 
the context capacity according to the connectivity properties, and in particular the symmetry 
of W. So far, our theoretical results were based on mean-held theory, which heavily relies on 
the assumption of independent coefficients Wjj (asymmetric random model). Indeed, symmetry 
introduces a large amount of dependence in the matrix W and mean-held theory fails for the 
symmetric random model [UlIIHlEn]. However, it is possible to evaluate numerically the context 
capacity in the symmetric case: as shown in FIG. 2 (right), C is much much lower in the symmetric 
case for small values of a, corresponding to an ’’almost-linear” regime, in accordance with results 
for the linear model, whereas it becomes much higher for larger values of a, a regime where 
the non-linear ehects become prominent. Indeed, in this regime, the asymmetric model displays 
chaotic dynamics hence a poor context sensitivity due to a high unreliability, while the symmetric 
model has an energy function [16], which prevents chaos and ensures a higher context capacity. 
With this new observation, it appears that the subtle interplay between connectivity properties 
and non-linearities is crucial for shaping the way neural networks remember their inputs, and that 
studying the problem only from a linear algebra perspective may be misleading. 

4. Discussion 

The problem of short-term memory in recurrent neural network has been investigated using a 
variety of models, from discrete-time networks EH 122], to continuous-time networks [2S1E1 and 
spiking networks [301 SO]- Using various approaches from statistics to information theory and 
dynamical systems, existing literature has mainly focused on three major questions: 

• How does memory relate to the number of nodes in the network ? 

Since 123 , it has been shown that the relationship between the memory capacity and the 
number n of nodes in the network is essentially linear. Beyond this linear relationship between 
the memory span and n, a recent study m has shown the ability of linear recurrent network 
to perform a compressed sensing operation and to achieve exponentially long memory for 
sparse inputs, echoing ideas introduced in [23] • 

• What is the role of non-linearities ? 

While short-term memory in linear random recurrent networks has been studied extensively 
in [221 ES], the case of non-linear models is less well understood as the increase of the 
amount of recurrence (e.g. through the special radius of the connectivity matrix) controls 
simultaneously the memory and the amount of nonlinearity in the representation. This trade¬ 
off between memory and nonlinearity has been investigated in particular from a theoretical 
perspective in liai. which shows how these two components can be dehned and measured, 
and how it imposes constraints on the overall performance of reservoir computing systems. 

• What is the impact of the connectivity structure ? 

As discussed in Section 2, the impact of connectivity structure on memory has been investi¬ 
gated in [22], showing the importance of non-normality of W. In [^123] . the specihc case of 
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orthogonal connectivity matrices is also studied, showing the robustness of such structures 
to noisy perturbations, a type of matrix also under consideration in [TT] to demonstrate the 
RIP property. Furthermore, in a series of articles [35l EH EZl EQ], several authors have ex¬ 
plored the impact of connectivity structure in terms of prediction performance, showing that 
simple deterministic connectivity, such as linear chains, may perform very well in various 
tasks. The relationship between memory and performance is not straightforward as it may 
be very task-dependent. However, a recent study na of the performance of linear ESN has 
identihed a connection between the Fisher memory curve of [22] and the mean-square-error 
prediction performance. Finally, various works have been interested in understanding the in¬ 
terplay between connectivity structure and autonomous non-linear reservoir dynamics (e.g. 
[151113) but not in the perspective of understanding short-term memory properties. 

The interplay between connectivity structure and non-linearity may have important conse¬ 
quences for short-term memory and theoretical studies of this problem remain scarce. The present 
theoretical analysis of short-term memory and context-dependent representation in recurrent neu¬ 
ral networks, although limited by its modeling assumptions (choice of the dynamical system, 
connectivity models), has contributed to advance the understanding of this phenomenon: 

1. Since [22], it is known that in linear models, the distinction between normal and non-normal 
connectivity matrices is very important. 

2. Similar results hold for other memory measures (mutual information and context capacity) in 
the linear case: random symmetric connectivities capture less information about the input. 

3. However, we have shown that this is no more the case in non-linear models : in the non-linear 
regime, symmetric random RNN outperform the asymmetric model. 

4. Mean-held theory is reaching its limitation: we have shown how it can be used for the 
asymmetric model, but it does not provide the key to unlock the symmetric one (more 
generally, structured models). 
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